Gemcitabine is a nucleoside analogue used as chemotherapy. It is marketed as Gemzar. The drug replaces one of the building blocks of nucleic acids, in this case cytidine, during DNA replication. The process arrest tumor growth because gemcitabine induces mast chain termination. Gemcitabine is used in various carcinomas: non-small cell lung cancer, pancreatic cancer, bladder cancer and breast cancer. It is being investigated for use in oesophageal cancer, and is used experimentally in lymphomas and various other tumor types.
Loading the data
#Set wd to the working directory where this Rscript is located
wd = getwd()
#Load data
basalexpression <- readRDS(paste0(wd,"/CCLE_basalexpression.rds"))
copynumber <- readRDS(paste0(wd,"/CCLE_copynumber.rds"))
mutations <- readRDS(paste0(wd,"/CCLE_mutations.rds"))
treated <- readRDS(paste0(wd,"/NCI_TPW_gep_treated.rds"))
untreated <- readRDS(paste0(wd,"/NCI_TPW_gep_untreated.rds"))
drugsensitivity <- readRDS(paste0(wd,"/NegLogGI50.rds"))
cannotation <- read.table(paste0(wd,"/cellline_annotation.tsv"), header = TRUE,
sep ="\t", stringsAsFactors = TRUE)
dannotation <- read.table(paste0(wd,"/drug_annotation.tsv"), header = TRUE,
sep ="\t", stringsAsFactors = TRUE)
metadata <- read.table(paste0(wd,"/NCI_TPW_metadata.tsv"), header = TRUE,
sep ="\t", stringsAsFactors = TRUE)
#scale treated & untreated
untreated_normalized <- scale(untreated)
treated_normalized <- scale(treated)
library(viridis)
library(car)
library(BBmisc)
library(plyr)
library(dplyr)
library(qpcR)
library(ggplot2)
library(pheatmap)
library(ComplexHeatmap)
library(circlize)
library(colorspace)
library(GetoptLong)
The treated data is illustrated with a boxplot as follows.
boxplot(treated, main= "Gene Expression of treated celllines",
ylab="Gene expression", xlab="Samples", names = FALSE)
Numerous batches are observed in the above plot. About 15 of them are counted. This number also corresponds to the number of drugs used. The boxplot is coloured according to the drug used in the following diagram to investigate the nature of the batches.
palette(viridis(15))
drugnames <- metadata$drug
par(mar=c(5,4,5,9))
boxplot(treated,medcol= "black", border=drugnames, col=drugnames, xlab="Samples/ Cell Lines",
ylab="Gene expression", main="Gene Expression of treated celllines", names=FALSE,
xaxt="n", boxwex=1, boxlty=0)#
levels=as.factor(levels(drugnames))
legend("topright",inset=c(-0.4,0), legend=levels(drugnames),xpd=TRUE,pch=19,
col=levels,title="drugs")
As it can be seen from the above diagram, each batch corresponds to a particular drug treatment. It can also be noted that the mean of the boxplots are not levelled. This indicates the lack of scaling. Thus, the data is scaled and a boxplot is generated again.
boxplot(treated_normalized, main = "Scaled gene expression after treatment",
ylab = "Gene expression", xlab = "Samples/ Cell Lines", names = FALSE)
The levelled mean indicates scaling.
PCA is a good method to reduce dimensionality of data. Here we carry out a PCA of the fold change so the difference in gene expression after drug treatment. The fold change is calculated by substracting the untreated from the treated data. We also want to see whether it is tissue associated as well.
#First of all we calculate the change of gene expression after treatment
fcgeneexpression <- treated_normalized - untreated_normalized
pca_FC = prcomp(fcgeneexpression, center = F, scale. = F)
pca_untreated = prcomp(untreated_normalized, center = F, scale. = F)
#plot PC 1 and 2 of FC associated to the drug
drug_as_factor <- as.factor(metadata[1:ncol(treated),"drug"])
levels <- as.factor(levels(drug_as_factor))
palette(rainbow(15))
par(xpd=T, mar=par()$mar+c(0,0,0,6))
plot(pca_FC$rotation[, 1], pca_FC$rotation[, 2], pch = 20, xlab = "PC 1",
ylab = "PC 2", col=drug_as_factor, main = "PC 1 & 2 of the fold change ")
legend("right",inset = c(-0.3,0), levels(drug_as_factor), xpd = TRUE,
pch=20, col = levels)
#scree plot to show how much variance the PCs explain
plot(pca_FC, type = "l", main = "PCAs of the fold change")
A scree plot shows, how much variance a Principal Component explains. As you can see, the knick in the curve is at PC 2.
#plot PC 3 & 4 of FC associated to drug
par(xpd=T, mar=par()$mar+c(0,0,0,5))
plot(pca_FC$rotation[, 3], pca_FC$rotation[, 4], pch = 20, xlab = "PC 3",ylab = "PC 4", col=drug_as_factor, main = "PC 3&4 of the fold change ")
legend("right", inset=c(-0.25,0),levels(drug_as_factor), pch=20,col = levels, xpd =TRUE)
#pca of untreated assigned to tissue
tissue_as_factor <- as.factor(metadata[1:ncol(treated), "tissue"])
levels_t <- as.factor(levels(tissue_as_factor))
plot(pca_untreated$rotation[, 1], pca_untreated$rotation[, 2], col = metadata[1:ncol(untreated), "tissue"], pch = 20, xlab = "PC1",
ylab = "PC2", main = "PC 1&2 of gene expression without treatment")
legend("right", inset=c(-0.25,0),levels(tissue_as_factor), pch=20,
col = levels_t, xpd =TRUE)
#pca 1&2 of the fold change tissue associated
plot(pca_FC$rotation[, 1], pca_FC$rotation[, 2], col = metadata[1:ncol(untreated), "tissue"], pch = 20, xlab = "PC 1",
ylab = "PC 2", main = "PC 1&2 of the fold change")
legend("right", inset=c(-0.25,0),levels(tissue_as_factor), pch=20,
col = levels_t, xpd =TRUE)
After the PCA of the fold change, you can see some clusters. So the fold change is drug associated, which shows that the drug treatment influences the gene expression. On the other hand the gene expression of the untreated cells is tissue associated, which also makes sense because same tissue means also a similar gene expression due to same house-keepinge genes for example. What we cannot see is that the FC is associated to the tissue, which also makes sense because the PCA is carried out for all drugs. If we would have more samples treated with the same drug, maybe we then could see a relation to the tissue.
A T-test is a statistical hypothesis test which determines if there is a significant difference between the means of two groups. Here our 2 groups are the set of treated and untreated data. Since each piece of treated data corresponds to one piece of untreated data, a paired T-test is carried out using the following Hypothesis:
The H0 Hypothesis: there is no change in the mean of the treated and untreated data.
Alternative Hypothesis (H1): there is a difference in mean between the treated and untreated data.
Chosen limit: when pValue < 0.05, H0 is rejected
However, normalised data should be used for the T-test. The data is normalised and illustrated in Q-Q plots as shown below.
xGem= grep("gemcitibine",colnames(treated))
Treated_Cells_Gemcitabine= treated[,xGem]
yGem= grep("gemcitibine",colnames(untreated))
Untreated_Cells_Gemcitabine=untreated[,yGem]
Treated_Gemcitabine_Normalised=normalize(Treated_Cells_Gemcitabine)
Untreated_Gemcitabine_Normalised=normalize(Untreated_Cells_Gemcitabine)
par(mfrow=c(1,2))
qqPlot(Treated_Gemcitabine_Normalised, main="Q-Q Plot Normalised treated data", ylab= "Normalised treated Gemcitabine data", id=FALSE)
qqPlot(Untreated_Gemcitabine_Normalised, main="Q-Q Plot Normalised untreated data", ylab= "Normalised untreated Gemcitabine data",id= FALSE )
commondataframe=data.frame(Untreated_Cells_Gemcitabine,Treated_Cells_Gemcitabine)
Column_untreated=grep("_0nM",colnames(commondataframe))
Column_treated=grep("_2000nM",colnames(commondataframe))
pValues <- apply(commondataframe, 1, function(x) t.test(x[Column_untreated],x[Column_treated],paired = TRUE, alternative = "two.sided")$p.value)
commondataframe_pvalue <- data.frame(commondataframe, pValues)
sum(pValues < 0.05)
## [1] 6744
6744 genes have a p value less than 0.05. Thus, this leads to the rejection of the null hypothesis and assumption that there could be a difference between the treated and untreated data for these particular genes.
In order to assess the effect that gemcitabine has in the cancer cells, we will define certain genes, that deliver the most information. To this end, we will base the selection of the genes on their expression, or more specifically, the change in their expression before and after the treatment with the drug. These will be our biomarkers.
The first step is to calculate a the absolute value of the fold change, to find the difference in expression of every single gene between treated and untreated cell lines. This way we can detect the biggest changes, regardles off if the expession of the genes was up- or down-regulated.
absvalfcgeneexpression <- abs(fcgeneexpression)
Taking only the cell lines in the fold change matrix that were treated with gemcitabine
absvalfcge_gemcitabine <- dplyr::select(as_tibble(absvalfcgeneexpression), contains("gemcitibine"))
absvalfcge_gemcitabine <- as.matrix(absvalfcge_gemcitabine)
rownames(absvalfcge_gemcitabine) <- rownames(absvalfcgeneexpression)
Create a new matrix with the top differentially expressed genes in descending order
orderforvectors <- function(x) {
y <- names(x[order(-x)])
return(y)
}
#The new function is applied to the columns of the fold change matrix
matrixofbiomarkers <- apply(absvalfcge_gemcitabine, 2, orderforvectors)
Check for the most common genes in the first 15 rows of the new matrix
topgenes <- sort(table(as.vector(matrixofbiomarkers[1:15,])))
matrixtopgenes <- as.matrix(topgenes[order(-topgenes)])
barplot(matrixtopgenes[1:30,], xlab = "", ylab = "Counts",
main = "Top differentially expressed genes",
cex.names = 0.7, las = 2, col = "#00CC66")
mtext("Genes", side = 1, line = 4)
We decided to take the top 17 genes with the highest difference in expression before and after treatment as our biomarkers.
List of our biomarkers:
biomarkers <- names(matrixtopgenes[1:17,1])
biomarkers
## [1] "ATF3" "CXCL8" "GADD45A" "TNFAIP3" "CDKN1A" "GDF15" "BTG2"
## [8] "EGR1" "GEM" "IL11" "TNFRSF9" "CEACAM1" "ANKRD1" "DHRS2"
## [15] "BIRC3" "TP53I3" "TXNIP"
Now that we know which genes have the highest difference in gene expression, we will check if each of them is up- or down-regulated.
fcge_gemcitabine <-
dplyr::select(as_tibble(fcgeneexpression), contains("gemcitibine"))
fcge_gemcitabine <- as.matrix(fcge_gemcitabine)
rownames(fcge_gemcitabine) <- rownames(fcgeneexpression)
fcge_gemcitabine <- fcge_gemcitabine[which(rownames(fcge_gemcitabine) %in% biomarkers),]
meanfcge_gemcitabine <- apply(fcge_gemcitabine, 1, mean)
meanfcge_gemcitabine <- meanfcge_gemcitabine[order(-meanfcge_gemcitabine)]
barplot(meanfcge_gemcitabine, xlab = "", ylab = "Fold change", ylim = c(-1, 3),
main = "Mean change in gene expression of the biomarkers with gemcitabine",
cex.names = 0.9, las = 2, col = "#00CC66")
mtext("Biomarkers", side = 1, line = 4)
Apparently most of the biomarkers are up-regulated after the treatment with gemcitabine. Only DHRS2 is down-regulated.
To further understand the effects of gemcitabine in the cells, we decided to compile the main functions of each biomarker. The information about each gene was taken from the gene database of the National Center for Biotechnology Information (NCBI) of the USA.
| Biomarker | Description |
|---|---|
| ATF3 | This gene encodes for activating transcription factors. It is often highly expressed in cancer cells and is involved in the complex process of cellular stress response. |
| CXCL8 | It codes for Interleukin-8, which induces the migration of immune cells towards sites of infection, stimulates phagocytisis and promotes angiogenesis. |
| GADD45A | Member of a group of genes whose transcription increases when the cell is under stress or being treated with DNA-damaging agents, such as gemcitabine. |
| TNFAIP3 | Encodes for TNF alpha induced protein 3, which is involved in the cytikine-mediated immune and inflammatory responses. |
| CDKN1A | This gene codes for a cyclin-dependent kinase inhibitor, which plays a role in the regulation of the cell cycle, including DNA replication and DNA damage repair. Since gemcitabine produces broken DNA strands, it makes sense that this DNA-damage-repair protein would be up-regulated in cancer cells to counteract the drug’s effects. |
| GDF15 | Encodes for a TGF-beta ligand, which regulates gene expression. Increased levels of the protein are linked with inflammation and oxidative stress. |
| BTG2 | Encodes for a protein of the BTG/Tob family, which appears to have antiproliferative effects. |
| EGR1 | Encodes for a nuclear protein and transcription regulator; it works as a cancer suppressor gene. |
| GEM | This gene encodes for a GTP-binding protein, which could have a regulatory function in receptor-mediated signal transduction. |
| IL11 | Encodes for Interleukin-11, which stimulates the development of immunoglobulin-dependent B-cells. |
| TNFRSF9 | Encodes for a protein of the TNF-receptor superfamily, which contributes to the clonal expansion, survival and development of T-cells. |
| CEACAM1 | Encodes for a cell-cell adhesion molecule that plays a role in angiogenesis, apoptosis, tumor suppression and metastasis. |
| ANKRD1 | Encodes for a transcription factor, and its expression is induced bu IL-1 and TNF-alpha. |
| DHRS2 | Encodes for a dehydrogenase/reductase of a family of enzymes known to metabolize different compounds such as prostaglandins, retinoids, lipids and xenobiotics. |
| BIRC3 | Encodes for a protein that inhibits apoptosis by binding to TNF receptor-associated factors. |
| TP53I3 | The expression of this gene is induced by the tumor suppressor p53 and is thought to be involved in p53-mediated cell death. |
| TXNIP | Encodes for a thioredoxin-binding protein, which inhibits thioreduxin and results in the accumulation of reactive oxigen species in the cell and increased oxidative stress. It is suspected to function as a tumor-suppressor gene. |
Most of the biomarkers found are up-regulated after the cell is treated with gemcitabine. When looking at the functions that these genes have, it appears that this up-regulation would be beneficial for the treatment of cancer, since several of them are tumor-suppressor genes or have immune-activating effects. The one biomarker that is down-regulated (DHRS2) is an enzyme that metabolzes xenobiotics, so its reduced expression is also beneficial, since it would presumably decrease the speed of degradation of gemcitabine.
MSI is defined as a deficit in the MMR (Mismatch Repair system). The defect is caused by a downregulation in the gene MLH1 and MSH6. This result in more mistakes while transcribing the DNA, so more mutations will occur. If we plot the number of mutations against the celllines, MSI-H celllines should have more mutations. To check this, we do the following plot.
mutation_cellline <- mutations[,"Tumor_Sample_Barcode"]
counting_mutations <- plyr::count(mutation_cellline, vars = NULL, wt_var = NULL)
#creating a new dataframe with the Celllines and high/low Microsatellite Instablility.
#TRUE:High
#FALSE:Stable/Low
L=cannotation$Microsatellite_instability_status=="MSI-H"
Cell_Line=data.frame("Cell Line name"= cannotation[,"Cell_Line_Name"],"Microsatellite"= L)
Cell_Line <- Cell_Line[order(Cell_Line$Cell.Line.name),]
drugsensitivity_transposed=t(drugsensitivity)
#The Cell_Line dataframe is directly merged with the Gemcitabine Drug Sensitivity Values.
Cell_Line_no_nas <- na.omit(Cell_Line)
table_of_mutations_na <- qpcR:::cbind.na(counting_mutations, Cell_Line_no_nas)
table_of_mutations <- na.omit(table_of_mutations_na)
mutation_plot <- qplot(
x = table_of_mutations[,3],
y = table_of_mutations[,2],
data = table_of_mutations,
color = Microsatellite,
xlab = "Cell line",ylab = "Mutations")
mutation_plot + ggtitle("Number of mutations per cellline coloured by MSI
status")+ xlab("Cell Line")+ylab("Mutations")+theme(axis.text.x = element_text(angle = 90, size = 7))
It can be observed from the previous plot that the MSI-H cells tend to have a higher number of mutations than MSI-L cells. From this observation arises the question of whether the MSI status also affect the sensitivity of cells to gemcitabine.
MSI-H cancers have a better prognosis when it comes to immune activating therapies, because of new immunogenic antigens being produced. However, the question remains open, of whether the Microsatellite Instability also affect other non-immune mechanisms in the cell. The deficit in the MMR could result in faulty essential proteins being produced, which would lead to cell death, but the higher mutation rate could also lead to a faster development of drug escape mechanisms.
In order to investigate the relationship between sensitivity to gemcitabine and the MSI status, we plot the EC50 value of all cell lines and colour them according to the MSI status.
# Plotting the GI50 values relating to Gemcitabine treatment .
gemcitabine_line=data.frame("Gemcitabine drug sensitivity"= drugsensitivity_transposed[,"gemcitibine"])
everything=na.omit(cbind(Cell_Line,gemcitabine_line))
basic=ggplot(everything, aes(Cell.Line.name, Gemcitabine.drug.sensitivity, colour = factor(Microsatellite), shape = factor(Microsatellite) )) +
geom_point()
basic + ggtitle("Drug Sensitivity of Gemcitabine treated cell lines according to MSI status")+ xlab("Cell Line")+ylab("Drug Sensitivity")+theme(axis.text.x = element_text(angle = 90, size = 7))
A relationship between high drug sensitivity and high microsatellite instability cannot be concluded from this plot.
To investigate whether MSI status affects the fold change after gemcitabine treatment, we explore the fold change of our 17 biomarkers for gemcitabine in the MSI-High and MSI-Low celllines. Furthermore, the question arises of whether the gene expression of key genes involved in the DNA repair mechanisms is also affected. To this end, we picked MLH1 and MSH6, which should theoretically be downregulated.
#Select the columns names that contain gemcitabine
gmci <- grepl('gemcitibine', colnames(fcgeneexpression))
#rows 365:420(gemcitibine), make dataframe with gemci and biomarkers
colnames(fcgeneexpression) = metadata[1:ncol(treated),"cell"]
biomarkers_geneexpression <- fcgeneexpression[c("ATF3", "CXCL8", "GADD45A", "TNFAIP3", "CDKN1A", "GDF15", "BTG2", "EGR1", "GEM", "IL11", "TNFRSF9", "CEACAM1", "ANKRD1", "DHRS2", "BIRC3", "TP53I3", "TXNIP"), gmci]
#MSI related genes, choosen after research, new matrix for untreated and FC
geneexpression_MSI_related <- fcgeneexpression[c("MLH1", "MSH6"), 365:420]
geneexpression_MSI_related_untreated <- untreated_normalized[c("MLH1", "MSH6"), 365:420]
colnames(geneexpression_MSI_related_untreated) <- metadata[365:420, "cell"]
#which cellines MSI-H or L?
#View(Cell_Line)
#MSI-H in col = 8,12,11,17,19,27,42,45 of geneexpression MSI related which is only gemcitibine specific
#define new matrices to divide matrices of FC and untreated in MSI-H and MSI-L
biomarkers_geneexpression_MSIH <- biomarkers_geneexpression[,c(8,12,11,17,19,27,42,45)]
biomarkers_geneexpression_MSIL <- biomarkers_geneexpression[,-c(8,12,11,17,19,27,42,45)]
geneexpression_MSI_related_untreated_MSIH <- geneexpression_MSI_related_untreated[, c(8,12,11,17,19,27,42,45)]
geneexpression_MSI_related_untreated_MSIL <- geneexpression_MSI_related_untreated[, -c(8,12,11,17,19,27,42,45)]
#heatmap of untreated geneexpression with MSI related genes
pheatmap(cbind(geneexpression_MSI_related_untreated_MSIH, geneexpression_MSI_related_untreated_MSIL), cluster_rows = FALSE, cluster_cols = FALSE, gaps_col = 8, cellheight = 20, name = "MLH1 and MSH6")
In the previous graph can be observed that there is a difference in the expression of these two genes in MSI-H and MSI-L cancer cells.
Next we investigate if the MSI status affects the expression of our biomarkers after gemcitabine treatment.
#Two heatmaps of the biomarkers' FC in MSI-Low and MSI-High celllines
ht_biomarkers_geneexpression_MSIH = Heatmap(biomarkers_geneexpression_MSIH,
name = "FC MSI-High", column_title = "MSI-H", column_names_gp = gpar(fontsize = 5))
ht_biomarkers_geneexpression_MSIL = Heatmap(biomarkers_geneexpression_MSIL,
name = "FC MSI-Low", column_title = "MSI-L", column_names_gp = gpar(fontsize = 5))
ht_biomarkers_geneexpression_MSIH + ht_biomarkers_geneexpression_MSIL
No conclusion can be drawn from the heatmaps above, of whether the gene expression of the biomarkers in MSI-H and MSI-L cells differs importantly.
In this part it is invastigated how the gemcitabine sensitivity is dependend on the cellline and further dependend on the different types of somatic mutations.
In order to investigate the difference in gemcitabine sensitivity for different cell lines a mean graph plot is used.
#calculation of EC50 mean for gemcitabine
rowMeans(drugsensitivity, na.rm = TRUE, dims = 1)
The EC50 mean value is 7,003322 for gemcitabine
#creating a vector for EC50 values of gemcitabine only
NegLogGI50gemcitabine<- drugsensitivity[-1:-7,]
NegLogGI50gemcitabine2<- NegLogGI50gemcitabine[-2:-8,]
#substract the value 7,003322 for EC50 value of gemcitabine for every cell line
EC50value <- NegLogGI50gemcitabine2 - 7.003322
Neg_LogGI50gemcitabine4 <- as.data.frame(EC50value)
Neg_LogGI50gemcitabine4$`cell_line` <- rownames(Neg_LogGI50gemcitabine4)
Neg_LogGI50gemcitabine4$EC50value_type<- ifelse(Neg_LogGI50gemcitabine4$EC50value< 0,"below", "above")
set.text <- element_text(face = "bold.italic", color = "black", size = 3)
ggplot(Neg_LogGI50gemcitabine4, aes(x=`cell_line`, y=EC50value, label=EC50value)) +
theme(axis.text = set.text)+
geom_bar(stat='identity', aes(fill=EC50value_type), width=0.7) +
scale_fill_manual(name="value",
labels = c("Above Average", "Below Average"),
values = c("above"="#00ba38", "below"="#f8766d")) +
labs(subtitle=" ",
title= "Cell line and drug sensitivity") +
coord_flip()
From the mean graph plot it can be observed that the different celllines had differences in gemcitabine sensitivity. The next step is to investigate how the count of different somatic mutation types differ for celllines with a high and a low gemcitabine sensitivity.
#select celllines with 10 highest and 10 lowest EC50 values
sort(NegLogGI50gemcitabine2, decreasing = FALSE)
## HCC-2998 HS-578T OVCAR-4 TK-10 PC-3 MDA-MB-231
## 5.000 5.000 5.000 5.000 5.096 5.139
## SK-MEL-28 OVCAR-3 KM12 HCT-15 NCI-H522 T-47D
## 5.270 5.441 5.547 5.658 5.870 5.997
## COLO-205 RPMI-8226 SK-MEL-2 SW-620 NCI-H322M HOP-92
## 6.246 6.251 6.407 6.466 6.528 6.652
## SK-OV-3 OVCAR-5 NCI-H226 MALME-3M K-562 BT-549
## 6.668 6.740 6.836 7.054 7.084 7.129
## IGR-OV1 EKVX UACC-257 SR A498 MOLT-4
## 7.152 7.167 7.212 7.237 7.262 7.356
## SK-MEL-5 SN12C SF-295 UACC-62 SF-268 RXF-393
## 7.406 7.408 7.410 7.417 7.422 7.440
## HL-60 LOX DU-145 CCRF-CEM SNB-75 HOP-62
## 7.544 7.545 7.563 7.572 7.588 7.594
## U251 MDA-MB-435 MDA-MB-468 UO-31 SF-539 A549
## 7.656 7.724 7.745 7.757 7.777 7.802
## NCI-ADR-RES SNB-19 OVCAR-8 CAKI-1 NCI-H23 HCT-116
## 7.842 7.882 7.914 8.034 8.103 8.127
## NCI-H460 M14 MCF7 786-0 ACHN
## 8.186 8.195 8.258 8.399 8.421
Cell lines with the highest EC50 values are SNB-19, OVCAR-8 ,CAKI-1,NCI-H23 ,HCT-116, NCI-H460 ,M14,MCF7 ,786-0,ACHN. Since SNB-19 is not part of the “mutations” data I am going to proceed with the nine remaining celllines. Cell lines with the lowest EC50 values are HCC-2998, HS-578T, OVCAR-4 , TK-10, PC-3, MDA-MB-231, SK-MEL-28 , OVCAR-3 ,KM12, HCT-15.
celllineshigh <- c("OVCAR-8","CAKI-1","NCI-H23","HCT-116","NCI-H460","M14",
"MCF7","786-0","ACHN")
celllineslow <- c("HCC-2998","HS-578T","OVCAR-4","TK-10","PC-3","MDA-MB-231",
"SK-MEL-28","OVCAR-3","KM12","HCT-15")
Missense mutation for celllines with high EC50 values
missensehigh = c()
for (i in (celllineshigh) ) {
b= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Missense_Mutation"))
missensehigh = c(missensehigh, b)}
Missense mutation for celllines with low EC50 values
missenselow = c()
for (i in (celllineslow) ) {
a = (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Missense_Mutation"))
missenselow = c(missenselow, a)}
Silent mutation for celllines with high EC50 values
silenthigh = c()
for (i in (celllineshigh) ) {
c =(sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Silent"))
silenthigh = c(silenthigh, c)}
Silent mutation for celllines with low EC50 values
silentlow = c()
for (i in (celllineslow) ) {
d = (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Silent"))
silentlow = c(silentlow, d)}
Nonsense mutation for celllines with high EC50 values
nonsensehigh =c()
for (i in (celllineshigh) ) {
e =(sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Nonsense_Mutation"))
nonsensehigh= c(nonsensehigh, e)}
Nonsense mutation for celllines with low EC50 values
nonsenselow =c()
for (i in (celllineslow) ) {
f= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Nonsense_Mutation"))
nonsenselow= c(nonsenselow, f)}
Splice site mutation for celllines with high EC50 values
splicesitehigh = c()
for (i in (celllineshigh) ) {
g =(sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Splice_Site"))
splicesitehigh = c(splicesitehigh, g)}
Splice site mutation for celllines with low EC50 values
splicesitelow = c()
for (i in (celllineslow) ) {
h = (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Splice_Site"))
splicesitelow = c(splicesitelow, h)}
Frame shift deletion for celllines with high EC50 values
frameshifthigh = c()
for (i in (celllineshigh) ) {
j= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Frame_Shift_Del"))
frameshifthigh = c(frameshifthigh, j)}
Frame shift deletion for celllines with low EC50 values
frameshiftlow = c()
for (i in (celllineslow) ) {
k = (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Frame_Shift_Del"))
frameshiftlow = c(frameshiftlow, k)}
Frame shift insertion for celllines with high EC50 values
frameshiftinshigh= c()
for (i in (celllineshigh) ) {
l= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Frame_Shift_Ins"))
frameshiftinshigh= c(frameshiftinshigh,l)}
Frame shift insertion for celllines with low EC50 values
frameshiftinslow = c()
for (i in (celllineslow) ) {
p= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "Frame_Shift_Ins"))
frameshiftinslow = c(frameshiftinslow, p)}
De novo out of frame mutation for celllines with high EC50 values
denovohigh = c()
for (i in (celllineshigh) ) {
u= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "De_novo_Start_OutOfFrame"))
denovohigh= c(denovohigh, u)}
De novo out of frame mutation for celllines with low EC50 values
denovolow = c()
for (i in (celllineslow) ) {
r= (sum(mutations[
which(mutations$Tumor_Sample_Barcode == i),"Variant_Classification"]
== "De_novo_Start_OutOfFrame"))
denovolow = c(denovolow, r)}
Matrix for celllines with high EC50 values
mutationtype <- c("missense", "silent", "nonsense","splicesite","frame_shift_deletion",
"frame_shift_insertion","de_novo_out_of_frame")
HighEC50 <- cbind(missensehigh,silenthigh,nonsensehigh,splicesitehigh,
frameshifthigh,frameshiftinshigh,denovohigh)
dimnames(HighEC50) = list(
c("OVCAR-8","CAKI-1","NCI-H23","HCT-116","NCI-H460","M14","MCF7","786-0","
ACHN"),
mutationtype)
Matrix for celllines with low EC50 values
LowEC50 <- cbind(missenselow,silentlow,nonsenselow,splicesitelow,
frameshiftlow,frameshiftinslow,denovolow)
dimnames(LowEC50) = list(
c("HCC-2998","HS-578T","OVCAR-4","TK-10","PC-3","MDA-MB-231","SK-MEL-28",
"OVCAR-3","KM12","HCT-15"),
mutationtype)
Heatmaps
pheatmap(HighEC50,color = colorRampPalette(c("blue","white","purple",
"firebrick3","red","orange"))(10), breaks = c(4.8,10.7,15.0,19.0,26.4,66.6,149.0,
390.3,2229.0), main= "Celllines with high EC50 values")
pheatmap(LowEC50,color = colorRampPalette(c("blue","white","purple","firebrick3",
"red","orange"))(10), breaks = c(4.0,6.8,10.4,14.0,22.0,74.8,142.9,
198.0,564.1,6716.0), main= "Celllines with low EC50 values")
EC50 is the concentration for 50% of maximal inhibition of cell proliferation. Higher values indicate higher sensitivity. The most prominent difference in somatic mutation pattern between the celllines with high and low gemcitabine sensitivity is that the celllines with a low EC50 value have a much higher count of mutations, which could indicate that the count of mutations has an impact on gemcitabine sensitivtiy. Further we are going to look at the portion of different mutation types. In the heatmap it is visible that for most mutation types the mutation count pattern is similar. However, for nonsense and splice site mutation you can see that the celllines with low EC50 values have a higher portion in those mutations in comparison to the celllines with high EC50 values. It is maybe linked to the fact that gemcitabine is an antimetabolite and stops DNA synthesis by inducing a masked chain termination. Nonsense mutations have the same effect of inducing a stop of DNA synthesis. Celllines in which a stop of DNA synthesis occur more often due to nonsense mutation are maybe therefore less sensitive to gemcitabine.
However mutation frequency or type alone is insufficient to identify and predict the drug sensitivity of a cellline towards gemcitabine.
A more effective way to explain a celllines sensitivity to gemcitabine are mutations of genes that play an important role in the mechanisms gemcitabine is involved in. One enzyme that plays an important role in converting gemcitabine from the prodrug into the final active drug form is deoxycytidine kinase (DCK) . It modifies gemcitabine by attaching a phosphate to it, converting it to gemcitabine monophosphate. The gene encoding this enzyme is called DCK.
print(mutations[which(mutations$Hugo_Symbol == "DCK"),"Tumor_Sample_Barcode"])
## [1] "HCC-2998"
The only cellline that has a DCK mutation is “HCC-2998” and it corresponds to the fact that it is also a cellline with a low gemcitabine sensitivity.
As seen previously, the analyzed celllines have different sensitivities to the treatment with gemcitabine. It would be useful to be able to predict the cell’s response to the drug before starting the treatment and be able to apply this knowledge in the real world. The ultimate goal would be to be able to predict the response of a patient’s cancer to the drug, so as to better decide if it is the ideal therapy in each case. To this end, we will use both linear and multiple regression analyses.
This linear regression will attempt to find a relation between:
1. The drug sensitivity of the celllines (EC50 value) and the gene expression of their biomarkers when untreated.
2. The drug sensitivity and the gene copy number of the biomarkers.
The first step is to check the correlation between the gene expression and gene copy number of the biomarkers, to decide if it makes sense to compare them separately to the drug sensitivity or if it is enough to compare only one of the two.
The biomarkers were selected from the treated and untreated matrices, so first we make sure that they are in the copy number matrix too.
biomarkers %in% rownames(copynumber)
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE
The second biomarker is not in this matrix, so it has to be removed from the gene expression one as well.
commonbiomarkers.lr <- intersect(rownames(copynumber), biomarkers)
#There is no need to intersect with the untreated matrix, because it was used to define the biomarkers, so it definitely contains all of them.
Keep only the common biomarkers in both matrices and order alphabetically:
cnforlr <- copynumber[order(rownames(copynumber)),]
cnforlr <- cnforlr[which(rownames(cnforlr) %in% commonbiomarkers.lr),]
untreatedforlr <- untreated_normalized[order(rownames(untreated_normalized)),]
untreatedforlr <- untreatedforlr[which(rownames(untreatedforlr) %in% commonbiomarkers.lr),]
The matrix for untreated has many more columns than the one of the copy number because it contains a measurement for each cellline before starting the treatment with each drug. Since no drug has been applied to any of them, all columns regarding a given cellline can be merged, and the information of drug, dose and time in the column name can be eliminated.
colnames(untreatedforlr) <- sub("_.*", "", colnames(untreatedforlr))
untreatedforlr <- sapply(split(seq_len(ncol(untreatedforlr)),colnames(untreatedforlr)),
function(cis) rowMeans(untreatedforlr[,cis,drop=F]))
dim(untreatedforlr)
## [1] 16 60
The matrix for untreated still lists more columns than the gene copy number, so only the common ones are to be kept, and we order them alphabetically:
commoncelllines.lr <- intersect(colnames(untreatedforlr), colnames(cnforlr))
untreatedforlr <- untreatedforlr[,which(colnames(untreatedforlr) %in% commoncelllines.lr)]
untreatedforlr <- untreatedforlr[,order(colnames(untreatedforlr))]
cnforlr <- cnforlr[,which(colnames(cnforlr) %in% commoncelllines.lr)]
cnforlr <- cnforlr[,order(colnames(cnforlr))]
The Spearman correlation of the copy number and gene expression of each biomarker is calculated.
cnforlr.t <- t(cnforlr)
untreatedforlr.t <- t(untreatedforlr)
correlations.cn.genexp <- c()
for (i in 1:ncol(cnforlr.t)) {
cor.i <- cor(cnforlr.t[,i], untreatedforlr.t[,i], method = "spearman")
correlations.cn.genexp <- c(correlations.cn.genexp, cor.i)
}
names(correlations.cn.genexp) <- colnames(untreatedforlr.t)
barplot(correlations.cn.genexp, ylim = c(-1,1), las = 2, cex.names = 0.8, col = "#3399FF",
ylab = "Correlation coefficient",
main = "Correlation between gene expression and copy number" )
mtext("Genes", side = 1, line = 4.3)
The correlation coefficient of all biomarkers’ gene copy number and gene expression are very close to 0, which means that there is a very low correlation between the two variables analyzed.
This low correlation was expected,since there are several mechanisms that control cells’ gene expression. Epigenetic variations in the different tissues, like the DNA methylation can silence the gene expression, and transcription factors also play an important role in the up- or down-regulation of gene expression. Therefore, taking only the gene copy number would harldy be enough to predict the expression of a given gene.
Since the correlation of the two variables is so low, we will proceed with separate linear regressions for:
1. Gene expression vs. drug sensitivity
2. Gene copy number vs. drug sensitivity
The drug sensitivity matrix contains the EC50 values of several celllines when exposed to different drugs. First of all, we make sure to leave only the celllines that are contained in the other two matrices for the regression.
dsforlr <- drugsensitivity[,which(colnames(drugsensitivity) %in% commoncelllines.lr)]
dsforlr <- dsforlr[,order(colnames(dsforlr))]
dsforlr <- dsforlr[order(rownames(dsforlr)),]
Create a matrix with the gene expression of the biomarkers and a row with the drug sensitivity of each cellline, to be used for the linear regressions:
genexp.drugsens <- rbind(untreatedforlr, dsforlr["gemcitibine",])
rownames(genexp.drugsens) <- c(rownames(genexp.drugsens)[1:nrow(genexp.drugsens)-1],
"DS_Gemcitabine")
genexp.drugsens <- t(genexp.drugsens)
Linear regression of the first biomarker (ATF3):
lrATF3genexp <- lm(DS_Gemcitabine~ATF3, data = as.data.frame(genexp.drugsens))
summary(lrATF3genexp)
##
## Call:
## lm(formula = DS_Gemcitabine ~ ATF3, data = as.data.frame(genexp.drugsens))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0193 -0.5833 0.2552 0.6949 1.4093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0249 0.1361 51.628 <2e-16 ***
## ATF3 0.1382 0.2809 0.492 0.625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9687 on 50 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.004816, Adjusted R-squared: -0.01509
## F-statistic: 0.242 on 1 and 50 DF, p-value: 0.6249
plot(DS_Gemcitabine~ATF3, data = as.data.frame(genexp.drugsens),
ylab = "EC50", main = "Linear regression with gene expression and EC50")
abline(lm(DS_Gemcitabine~ATF3, data = as.data.frame(genexp.drugsens)), col = "#3399FF")
R-squared values of the linear regression of each biomarker:
DS_gemcitabine <- genexp.drugsens[,"DS_Gemcitabine"]
all.r.squared.ge <- c()
for (i in 1:length(commonbiomarkers.lr)) {
r.squared.ge.i <-
summary(lm(DS_gemcitabine~genexp.drugsens[,i],
data = as.data.frame(genexp.drugsens)))$r.squared
all.r.squared.ge <- c(all.r.squared.ge, r.squared.ge.i)
}
names(all.r.squared.ge) <- colnames(genexp.drugsens[,1:ncol(genexp.drugsens)-1])
barplot(all.r.squared.ge, ylim = c(0,1), ylab = "R-squared value",
main = "R-squared values of the linear regressions using gene expression",
cex.names = 0.7, las = 2, col = "#3399FF")
mtext("Biomarkers", side = 1, line = 4)
The first step is to create a matrix with the gene copy number of the biomarkers and a row with the drug sensitivity of each cellline, just like it was done with the gene expression:
copynum.drugsens <- rbind(cnforlr, dsforlr["gemcitibine",])
rownames(copynum.drugsens) <- c(rownames(copynum.drugsens)[1:nrow(copynum.drugsens)-1],
"DS_Gemcitabine")
copynum.drugsens <- t(copynum.drugsens)
Linear regression of the first biomarker (ATF3):
lrATF3copynum <- lm(DS_Gemcitabine~ATF3, data = as.data.frame(copynum.drugsens))
summary(lrATF3copynum)
##
## Call:
## lm(formula = DS_Gemcitabine ~ ATF3, data = as.data.frame(copynum.drugsens))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1726 -0.5876 0.2784 0.6193 1.4727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0419 0.1377 51.155 <2e-16 ***
## ATF3 -0.2770 0.3278 -0.845 0.402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9642 on 50 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01408, Adjusted R-squared: -0.005638
## F-statistic: 0.7141 on 1 and 50 DF, p-value: 0.4021
plot(DS_Gemcitabine~ATF3, data = as.data.frame(copynum.drugsens),
ylab = "EC50", main = "Linear regression with gene copy number and EC50")
abline(lm(DS_Gemcitabine~ATF3, data = as.data.frame(copynum.drugsens)), col = "#3399FF")
R-squared values of the linear regression of each biomarker:
all.r.squared.cn <- c()
for (i in 1:length(commonbiomarkers.lr)) {
r.squared.i.cn <-
summary(lm(DS_gemcitabine~copynum.drugsens[,i],
data = as.data.frame(copynum.drugsens)))$r.squared
all.r.squared.cn <- c(all.r.squared.cn, r.squared.i.cn)
}
names(all.r.squared.cn) <- colnames(copynum.drugsens[,1:ncol(copynum.drugsens)-1])
barplot(all.r.squared.cn, ylim = c(0,1), ylab = "R-squared value",
main = "R-squared values of the linear regressions using gene copy number",
cex.names = 0.7, las = 2, col = "#3399FF")
mtext("Biomarkers", side = 1, line = 4)
The linear regressions showed that the gene expression or copy number of each biomarker by itself does not account for the drug sensitivity of the cellline. Therefore, we will inspect if a multiple regression using all 17 biomarkers can be used to better predict a cellline’s drug sensitivity.
multipleregression.ge <- lm(DS_Gemcitabine~ATF3 + GADD45A + TNFAIP3 + CDKN1A + GDF15 + BTG2 + EGR1
+ GEM + IL11 + TNFRSF9 + CEACAM1 + ANKRD1 + DHRS2 + BIRC3 + TP53I3
+ TXNIP, data = as.data.frame(genexp.drugsens))
summary(multipleregression.ge)
##
## Call:
## lm(formula = DS_Gemcitabine ~ ATF3 + GADD45A + TNFAIP3 + CDKN1A +
## GDF15 + BTG2 + EGR1 + GEM + IL11 + TNFRSF9 + CEACAM1 + ANKRD1 +
## DHRS2 + BIRC3 + TP53I3 + TXNIP, data = as.data.frame(genexp.drugsens))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7656 -0.4278 0.1006 0.4659 1.3952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.43403 0.93532 9.017 1.18e-10 ***
## ATF3 -0.16245 0.36937 -0.440 0.6628
## GADD45A -0.20263 0.41093 -0.493 0.6250
## TNFAIP3 -0.22199 0.37392 -0.594 0.5565
## CDKN1A 0.36271 0.41575 0.872 0.3889
## GDF15 0.14466 0.26645 0.543 0.5906
## BTG2 0.07558 0.51139 0.148 0.8834
## EGR1 -0.15467 0.29818 -0.519 0.6072
## GEM -0.16369 0.30821 -0.531 0.5987
## IL11 0.40809 0.46116 0.885 0.3822
## TNFRSF9 0.51274 0.35913 1.428 0.1622
## CEACAM1 -0.10579 0.32636 -0.324 0.7478
## ANKRD1 -0.22664 0.31811 -0.712 0.4809
## DHRS2 0.02433 0.29135 0.084 0.9339
## BIRC3 0.15333 0.26930 0.569 0.5727
## TP53I3 -0.44056 0.33406 -1.319 0.1958
## TXNIP -0.51490 0.24281 -2.121 0.0411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.961 on 35 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3143, Adjusted R-squared: 0.000845
## F-statistic: 1.003 on 16 and 35 DF, p-value: 0.4763
This multiple regression has an R-squared value of 0.3142, which is higher than what each individual biomarker by itself had, but still not enough to be used as a predictor of the drug sensitivity of the celllines.
multipleregression.cn <- lm(DS_Gemcitabine~ATF3 + GADD45A + TNFAIP3 + CDKN1A + GDF15 + BTG2 + EGR1
+ GEM + IL11 + TNFRSF9 + CEACAM1 + ANKRD1 + DHRS2 + BIRC3 + TP53I3
+ TXNIP, data = as.data.frame(copynum.drugsens))
summary(multipleregression.cn)
##
## Call:
## lm(formula = DS_Gemcitabine ~ ATF3 + GADD45A + TNFAIP3 + CDKN1A +
## GDF15 + BTG2 + EGR1 + GEM + IL11 + TNFRSF9 + CEACAM1 + ANKRD1 +
## DHRS2 + BIRC3 + TP53I3 + TXNIP, data = as.data.frame(copynum.drugsens))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28480 -0.42166 0.03478 0.44688 1.23132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.9666 0.2231 31.233 < 2e-16 ***
## ATF3 -0.9049 0.3887 -2.328 0.025838 *
## GADD45A 0.9799 0.6898 1.421 0.164272
## TNFAIP3 -0.3160 0.4101 -0.771 0.446142
## CDKN1A -1.2831 0.4106 -3.125 0.003567 **
## GDF15 0.7565 0.8654 0.874 0.387994
## BTG2 1.1294 0.8722 1.295 0.203850
## EGR1 -0.9827 0.5141 -1.911 0.064157 .
## GEM -0.3648 0.3751 -0.973 0.337455
## IL11 -2.6380 0.6240 -4.227 0.000161 ***
## TNFRSF9 -1.3399 0.4612 -2.905 0.006322 **
## CEACAM1 1.5477 0.5178 2.989 0.005090 **
## ANKRD1 -0.6009 0.5225 -1.150 0.257945
## DHRS2 -0.5643 0.3679 -1.534 0.134061
## BIRC3 -0.1238 0.2129 -0.582 0.564616
## TP53I3 0.1228 0.6466 0.190 0.850496
## TXNIP 0.1047 0.6086 0.172 0.864395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7838 on 35 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.544, Adjusted R-squared: 0.3355
## F-statistic: 2.609 on 16 and 35 DF, p-value: 0.008807
This regression has the highest R-squared value of all, but at 0.53 it still cannot be used to predict the drug sensitivity of the celllines.
All of the linear and multiple analyses performed yielded very low R-squared values. In an ideal case, when the variable or variables (gene expression and copy number) perfectly explained the behavior of another variable (the drug sensitivity in our case), this value would be equal to 1, but the highest value we got from our regressions was 0.544.
This was not entirely surprising, since the cellular response to a drug or any outside influence is a very complex process regulated by hundreds, if not thousands, of genes, so using only our 17 biomarkers to predict the cell’s response to gemcitabine was unlikely to yield any meaningful results. Furthermore, many more variables come into play to determine the cell’s response to a drug, such as specific mutations in particular genes, which we did not take into account for our analysis.